Last updated: 2019-09-11

This tutorial is a joint product of the Statnet Development Team:

Martina Morris (University of Washington)
Pavel N. Krivitsky (University of Wollongong)
Mark S. Handcock (University of California, Los Angeles)
Carter T. Butts (University of California, Irvine)
David R. Hunter (Penn State University)
Steven M. Goodreau (University of Washington)
Skye Bender de-Moll (Oakland)

For general questions and comments, please refer to the statnet wiki and the statnet users group and mailing list
http://statnet.org/statnet_users_group.shtml

Introduction to this workshop/tutorial.

This workshop and tutorial provide an introduction to key features of the packages that the statnet suite provides for statistical modeling of network data with Exponential family Random Graph Models (ERGMs). The packages we will be using are:

  • network – storage and manipulation of network data
  • sna – descriptive statistics and graphics for exploratory network analysis`
  • ergm – statistical tools for estimating ERGMs, model assessment, and network simulation.

All statnet packages are open-source, written for the R computing environment, and published on CRAN. The source repositories are hosted on GitHub.

Prerequisites

This workshop assumes basic familiarity with R, experience with network concepts, terminology and data, and familiarity with the general framework for statistical modeling and inference.

The workshops are conducted using Rstudio.

1. Software Installation and Preliminaries

Open an R session, and set your working directory to the location where you would like to save this work.

To install the package the ergm.ego,

This will install all of the “dependencies” – the other R packages that ergm.ego needs.

We will also need two other packages (from the “tidyverse”) for this workshop:

Load these packages

Add define some functions we’ll use later

Check package versions

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tibble_2.1.3 dplyr_0.8.3  ergm.ego_0.5 ergm_3.10.4  network_1.15 knitr_1.24  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2           magrittr_1.5         MASS_7.3-51.4        tidyselect_0.2.5    
 [5] lattice_0.20-38      R6_2.4.0             rlang_0.4.0          stringr_1.4.0       
 [9] tools_3.6.1          parallel_3.6.1       grid_3.6.1           lpSolve_5.6.13.2    
[13] xfun_0.8             coda_0.19-3          htmltools_0.3.6      assertthat_0.2.1    
[17] yaml_2.2.0           digest_0.6.20        crayon_1.3.4         Matrix_1.2-17       
[21] purrr_0.3.2          trust_0.1-7          robustbase_0.93-5    glue_1.3.1          
[25] evaluate_0.14        rmarkdown_1.14       statnet.common_4.3.0 stringi_1.4.3       
[29] DEoptimR_1.0-8       compiler_3.6.1       pillar_1.4.2         pkgconfig_2.0.2     

2. Overview of ergm.ego

The ergm.ego package is designed to provide principled estimation of and statistical inference for Exponential-family Random Graph Models (“ERGMs”) from egocentrically sampled network data.

This dramatically lowers the bar for empirical research on networks. In many (most?) empirical contexts, it is not possible to collect a network census or even an adaptive (link-traced) sample. Even when one of these may be possible in practice, egocentrically sampled data are typically cheaper and easier to collect.

Long regarded as the poor country cousin in the network data family, egocentric data contain a remarkable amount of information. With the right statistical methods, such data can be used to explore the properties of the complete networks in which they are embedded. The basic idea here will be familiar to anyone who has worked with survey data: you combine what is observed (the data) with assumptions (the model terms and their sampling distributions), to define a class of models (the range of coefficient values).
Once estimated, the fitted model represents a distribution of networks that is centered on the observed properties of the (appropriately scaled) sampled network. The stochastic variation in the network distribution quantifies some of the uncertainty introduced by the assumptions.

The ergm.ego package comprises:

  • a set of utilities to manage the data,
  • a set of “ergm terms” that can be used in models,
  • and a set of functions for estimation and inference that rely largely on the existing ergm package, but include the specific modifications needed in the egocentric data context.

ergm.ego is designed to work with the other statnet packages. So, for example, once you have fit a model, you can use the summary and diagnostic functions from ergm to evaluate the model fit, the ergm simulate function to simulate complete network realizations from the model, the network descriptives from sna to explore the properities of networks simulated from the model, and you can use other R functions and packages as well after converting the network data structure into a data frame.

Putting this all together, you can start with egocentric data, estimate a model, test the coefficients for statistical significance, assess the model goodness of fit, and simulate complete networks of any size from the model. The statistics in your simulated networks will be consistent with the appropriately scaled statistics from your sample for all of the terms that are represented in the model.

2a. Motivating example: Analyzing racial disparities in HIV in the US

The work on ergm.ego was originally motivated by a specific question in the field of HIV epidemiology—Does network structure help explain the persistent racial disparities in HIV prevalence in the United States?

An African American today is 10 times more likely than a white American to be living with HIV/AIDS. The disparity begins early in life, persists through to old age, and is evident among all risk groups: heterosexuals, men who have sex with men (MSM), and injection drug users. The disproportionate risks faced by heterosexual African-American women are especially steep. In 2010, an African-American woman was over 40 times more likely to be diagnosed with HIV than a heterosexual white man (Figure 1).

Figure 1

Figure 1

Empirical studies repeatedly find that these disparities cannot be explained by individual behavior, or biological differences.

A growing body of work is therefore focused on the role of the underlying transmission network. This network can channel the spread of infection in the same way that a transportation network channels the flow of traffic, with emergent patterns that reflect the connectivity of the system, rather than the behavior of any particular element.

Descriptive analyses and simulation studies have focused attention on two structural features: homophily and concurrency. Homophily is the strong propensity for within-group partner selection. Concurrency is non-monogamy—having partners that overlap in time–which increases network connectivity by allowing for the emergence of stable network connected components larger than dyads (pairs of individuals).

The hypothesis is that these two network properties together can produce the sustained HIV/STI prevalence differentials we observe: differences in concurrency between groups are the mechanism that generates the prevalence disparity, while homophily is the mechanism that sustains it.

We will never observe the complete dynamic sexual network that transmits HIV. But ergm.ego allows us to test the network hypothesis with egocentrically sampled data–and we will demonstrate that here using data collected by the National Health and Social Life Survey from 1994. The analysis comes from a recent paper (Krivitsky and Morris, 2017).

First, ergm.ego allows us to assess whether empirical patterns of homophily and concurrency are in the predicted directions and statistically significant. We do this in the usual way – comparing sequential model fits with terms that represent the hypotheses of interests, and t-tests for their coefficients. We will discuss these terms in more detail in later sections, but here we test the concurrency effects with “monogamy bias” terms.

Table 1

Table 1

Result: Yes, the homophily and concurrency effects are in the predicted directions and statistically significant.

Next, we can assess the goodness of fit of each model in the way we usually do in ERGMs, by checking whether the models reproduce observed nework properties that are not in the model. We do this here by simulating from each model and comparing the fits to the full observed degree distribution:

Figure 2

Figure 2

Result: Only Model 3 (with both hypothesized nework effects) is able to reproduce the observed degree distribution.

The ability to simulate complete networks from the model, however, allows us to do much more–we can now examine the connectivity in the overall network that each of these models would generate. For example, we can examine the component size distributions under each model:

Figure 3

Figure 3

Result: Model 3, with its “monogamy bias” dramatically reduces the right skew of the component size distribution, and places most people in components of size 2, or 3 if they are in a larger component.

Finally, we can define a measure of “network exposure” that represents the signature feature of a network effect: indirect exposure to HIV via a partner’s behavior, rather than direct exposure via one’s own behavior. One metric for network exposure is the probability of being in a component of size 3 or more. Because this is a node-level metric, we can break it down by race and sex for each of the three models:

Figure 3

Figure 3

Result: Only model 3 produces a pattern of network exposure that is consistent with the observed disparities in HIV incidence.

ergm.ego provides a powerful analytic framework that uses extremely limited network data and testable models to investigate the unobservable patterns of complete network connectivity that are consistent with the sampled data.

The principles of egocentric inference can be extended to temporal ERGMs (TERGMs). While we will not cover that in this workshop, an example can be found in another paper that sought to evaluate the network hypothsis for racial disparities in HIV in the US (Morris et al. 2009). In that paper, egocentric data from the National Longitudinal Survey of Adolescent Health (AddHealth) was analyzed, and an example of the resulting dynamic complete network simulation (on 10,000 nodes) can be found in this “network movie”.

The movie below is another simpler example of what you can do with egocentrically sampled data and the statnet software. It shows an epidemic spreading on a small dynamic contact network that is simulated with a STERGM estimated from egocentrically sampled network data. The movie was produced by the R packages EpiModel and ndtv, which are part of the statnet package suite (and are also available on CRAN).

3. Background

The full technical details on ERGM estimation and inference from egocentrically sampled data have been published in Krivitsky and Morris (2017). This section of the tutorial provides a brief introduction to the key concepts.

3a. Exponential-family random graph models

ERGMs represent a general class of models based in exponential-family theory for specifying the probability distribution for a set of random graphs or networks. Within this framework, one can—among other tasks—obtain maximum-likehood estimates (MLEs) for the parameters of a specified model for a given data set; test individual models for goodness-of-fit, perform various types of model comparison; and simulate additional networks with the underlying probability distribution implied by that model.

The general form for an ERGM can be written as: \[ P(Y=y;\theta,x)=\frac{\exp(\theta^{\top}g(y,x))}{\kappa(\theta,x)}\qquad (1) \] where:

  • \(Y\) is the random variable for the state of the network (with realization \(y\)),

  • \(g(y,x)\) is a vector of model statistics for network y,

  • \(\theta\) is the vector of coefficients for those statistics, and

  • \(\kappa(\theta)\) represents the quantity in the numerator summed over all possible networks (typically constrained to be all networks with the same node set as \(y\)).

If you’re not familiar with the compact notation here, note that the numerator represents a formula that is linear in the log form:

\[ log({\exp(\theta'g(y))}) = \theta_1g_1(y) + \theta_2g_2(y)+ ... + \theta_pg_p(y) \] where \(p\) is the number of terms in the model. From this one can more easily observe the analogy to a traditional statistical model: the coefficients \(\theta\) represent the size and direction of the effects of the covariates \(g(y)\) on the overall probability of the network.

The ERGM expression for the probability of the entire graph shown above can be re-expressed in terms of the conditional log-odds of a single tie between two actors:

\[ \operatorname{logit}{(Y_{ij}=1|y^{c}_{ij})=\theta'\delta(y_{ij})} \]

where

  • \(Y_{ij}\) is the random variable for the state of the actor pair \(i,j\) (with realization \(y_{ij}\)), and

  • \(y^{c}_{ij}\) signifies the complement of \(y_{ij}\), i.e. all dyads in the network other than \(y_{ij}\).

  • \(\delta(y_{ij})\) is a vector of the “change statistics” for each model term. The change statistic records how the \(g(y)\) term changes if the \(y_{ij}\) tie is toggled on or off. So:

\[ \delta(y_{ij}) = g(y^{+}_{ij})-g(y^{-}_{ij}) \]

where

  • \(y^{+}_{ij}\) is defined as \(y^{c}_{ij}\) along with \(y_{ij}\) set to 1, and
  • \(y^{-}_{ij}\) is defined as \(y^{c}_{ij}\) along with \(y_{ij}\) set to 0.

So \(\delta(y_{ij})\) equals the value of \(g(y)\) when \(y_{ij}=1\) minus the value of \(g(y)\) when \(y_{ij}=0\), but all other dyads are as in \(g(y)\).

This expression shows that the coefficient \(\theta\) can be interpreted as that term’s contribution to the log-odds of an individual tie, conditional on all other dyads remaining the same. The coefficient for each term in the model is multiplied by the number of configurations that tie will create (or remove) for that specific term.

The model terms \(g(y,x)\) are functions of network statistics that we hypothesize may be more or less common than what would be expected in a simple random graph (where all ties have the same probability). When working with egocentrically sampled network data, the statistics one can include in the model are limited by the requirement that they can be observed in the sample data more details below

A key distinction in model terms is dyad independence or dyad dependence. Dyad independent terms (like nodal homophily terms) imply no dependence between dyads—the presence or absence of a tie may depend on nodal attributes, but not on the state of other ties. Dyad dependent terms (like degree terms, or triad terms), by contrast, imply dependence between dyads. The design of an egocentric sample means that most observable statistics are dyad independent, but there are a few, like degree, that are dyad dependent.

3b. Network Sampling

Network data are distinguished by having two units of analysis: the actors and the links between the actors. This gives rise to a range of sampling designs that can be classified into two groups: link tracing designs (e.g., snowball and respondent driven sampling) and egocentric designs.

3b2. Egocentric Designs

Egocentric network sampling comprises a range of designs developed specifically for the collection of network data in social science survey research. The design is (ideally) based on a probability sample of respondents (“egos”) who, via interview, are asked to nominate a list of persons (“alters”) with whom they have a specific type of relationship (“tie”), and then asked to provide information on the characteristics of the alters and/or the ties. The alters are not recruited or directly observed. Depending on the study design, alters may or may not be uniquely identifiable, and respondents may or may not be asked to provide information on one or more ties among alters (the “alter” matrices). Alters could, in theory, also be present in the data as an ego or as an alter of a different ego; the likelihood of this depends on the sampling fraction.

Egocentric designs sample egos using standard sampling methods, and the sampling of links is implemented through the survey instrument. As a result, these methods are easily integrated into population-based surveys, and, as we show below, inherit many of the inferential benefits.

For the moment ergm.ego uses the minimal egocentric network study design, in which alters cannot be uniquely identified and alter matrices are not collected The minimal design is more common, and the data are more widely available, largely because it is less invasive and less time-consuming than designs which include identifiable alter matrices. However, deveopment of estimation where alter–alter matrices are available is being planned.

3c. Existing methods for sampled network data

Model-based

Handcock and Gile (2010): Likelihood inference for partially observed networks, has egocentric data as a special case.

Kosikinen and Robins (2010): Bayesian inference for partially observed networks, has egocentric data as a special case.

Pros:
  • Can fit any ERGM that can be identified.
  • Can handle link-tracing designs.
Cons:
  • Requires alters to be identifiable.
  • Cannot take into account sampling weights (unless all attributes that affect sampling weights are part of the model).
  • Might not scale.
  • Requires knowledge of the population distribution of actor attributes used in the model.

Design-based

Krivitsky and Morris (2017) Use design-based estimators for sufficient statistics of the ERGM of interest and then transfers their properties to the ERGM estimate.

Pros:
  • Does not require alters to be identifiable.
  • Borrows directly from design-based inference methods. (Can easily incorporate sampling weights, stratification, etc.)
  • Can fit any ERGM that can be identified (though see below).
  • Can be made invariant to network size for some models.
Cons:
  • Requires “reimplementation” of the model statistics as “EgoStats”: currenly does not support alter–alter statistics or directed or bipartite networks.
  • Relies on independent sampling form population of interest in some form.
  • Cannot be fit to more complex (e.g., RDS) designs.
  • Requires knowledge of the population distribution of actor attributes used in the model.

4. Theoretical Framework and Definitions

We’ll need some notation for this (sorry, and a warning that it will get hairier).

Population network

Parameter Meaning
\(N\) the population being studied: a very large, but finite, set of actors whose relations are of interest
\(x _ i\) attribute (e.g., age, sex, race) vector of actor \(i \in N\)
\(x_N\) (or just \(x\), when there is no ambiguity) the attributes of actors in \(N\)
\(\mathbb{Y}(N)\) the set of dyads (potential ties) in an undirected network of actors in \(N\)
\(y\subseteq \mathbb{Y}(N)\) the population network: a fixed but unknown network (a set of relationships) of relationships of interest. In particular,
\(y_{ij}\) an indicator function of whether a tie between \(i\) and \(j\) is present in \(y\)
\(y _ i=\{j\in N: y _ {ij}=1\}\) the set of \(i\)’s network neighbors.

Egocentric sample

Parameter Meaning
\(e_{N}\) the egocentric census, the information retained by the minimal egocentric sampling design when all nodes are sampled
\(S\subseteq N\) the set of egos in a sample
\(e_{S}\) the data contained in an egocentric sample
\(e_i\) the “egocentric” view of network \(y\) from the point of view of actor \(i\) (“ego”), with the following parts:
\(e^e_i \equiv x_i\) \(i\)’s own attributes
\(e^a_i \equiv (x_{j})_{j\in y_i}\) an unordered list of attribute vectors of \(i\)’s immediate neighbors (“alters”), but not their identities (indices in \(N\))
\(e^e_{i,k}\equiv x_{i,k}\) The \(k\)th attribute/covariate observed on ego \(i\)
\(e^a_{i,k}\equiv( x_{j,k})_{j\in y_i}\) and its alters.

4a. Specifying Egocentric ERGMs

Egocentric ERGMs are specified the same way they would be in ergm: via terms (e.g. nodematch) used to represent predictors on the right-hand size of equations used in calls to:

  • summary (to obtain measurements of network statistics on a dataset)
  • ergm.ego (to estimate an ERGM)
  • simulate (to simulate networks from an ERGM fit)

The terms that can be used in an ERGM depend on the type of network being analyzed (directed or undirected, one-mode or two-mode (“bipartite”), binary or valued edges) and on the statistics that can be observed in the sample.

Even if the whole population is egocentrically observed (i.e., \(S=N\), a census), the alters are still not uniquely identifiable. This limits the kinds of network statistics that can be observed, and the ERGM terms that can be fit to such data. We turn to the notion of sufficiency to identify the terms amenable to egocentric inference.

Egocentric Statistics

We call a network statistic \(g_{k}(\cdot,\cdot)\) egocentric if it can be expressed as \[ g_{k}(y,x)\equiv \textstyle\sum_{i\in N} h_{k}(e_i) \] for some function \(h_{k}(\cdot)\) of egocentric information associated with a single actor.


In plain language, a statistic is egocentric if you can take the individual level measure of that statistic (say, the number of ties), sum this (or a simple function of this) across all individuals in the population (i.e., an egocentric census) and recover the network level statistic (here, the total number of ties in the network).


The set of egocentric statistics includes dyadic-independent statistics that can be expressed in the general form of \[ g_{k}(y,x)=\sum_{ij\in y} f_k(x_i,x_j) \] for some symmetric function \(f_k(\cdot,\cdot)\) of two actors’ attributes; and some dyadic-dependent statistics that can be expressed as \[ g_{k}(y,x)=\sum_{i\in N} f_k ({x_{i},(x_j)_{j\in y_i}}) \] for some function \(f_k(\cdot,\dotsb)\) of the attributes of an actor and their network neighbors.

The statistics that are identifiable in an egocentric sample depend on the specific egocentric study design:

Basic (minimal) egocentric design (alter attributes only)
  • Nodal Covariate/Factor effects
  • Homophily
  • Degree distribution
With ego reports of alter degree (the number of alter’s ties)
  • Degree assortativity
With alter-alter ties
  • Triadic closure (transitive/cyclical ties, triangles)
  • 4-cycles (possibly)
Not Egocentric for other reasons, but estimable
  • Mean degree (\(g_{k}(y,x)=2|y|/|N|\)): \(e _ i\) doesn’t know how big the network is 1

The table below (from Krivitsky & Morris 2017) shows some examples of egocentric statistics, and gives their representations in terms of of \(h_{k}(\cdot)\). Examples of egocentric statistics

4b. Theoretical basis for inference

The framework for inference relies on two basic properties of exponential family models:

  • For MLEs, the expected value of a sufficient statistic (\(g(y,x)\)) in the model is equal to its observed value.

  • The MLE is a smooth function of the sufficient statistic, and is defined for “in between” values of the statistics as well (e.g., fractional edges).

MLE’s for exponential family models uniquely maximize the probability of the observed statistics under the model (the “likelihood”). Any network with the same observed statistics will have the same probability.

Egocentric inference involves two steps: estimating the sufficient statistics \(g(y)\), and then using these to estimate the ERGM coefficients \(\theta\).

4.b1. Preliminaries: The scaling assumption for design based inference of the sufficient statistics

With an egocentric sample, we don’t observe these sufficient statistics, we observe a sampled version of them. So the first question is how the sample statistics scale up to the population level. If we know this, then we can estimate the population level sufficient statistics.

A natural scaling assumption for dyadic independent statistics is that they scale on a per capita basis, i.e., that the mean degree (degree per capita) stays constant as the network size changes. \[ \text{Mean Degree} = \frac{2\times\text{ties}}{\text{nodes}} = \frac{2T}{n} \] For example, if a network of 5 people has 10 ties, the mean degree (degree per capita) is \(\frac{2*ties}{nodes} = \frac{20}{5} = 4\), so we would expect a network of 20 people to have \(\frac{20*4}{2} = 40\) ties.
As with everything network, things are more complicated for dyad-dependent statistics. We will not cover the scaling for dyad-dependent statistics in this workshop, but if you are interested see Krivtsky and Kolaczyk (2015).

This type of estimate can easily be adapted to handle sample weights.

With this assumption, we can use a simple “design-based” estimator to estimate the sufficient statistics for our models from the observednsampled network statistics.

4.b2 Estimation and Inference for ERGMs from egocentric data

In a nutshell:

  1. Estimate the ERGM’s sufficient statistics using a “design-based” estimater that relies on the per capita scaling assumption \[g(y,x)\approx \tilde{g}(e _ S)=\frac{|N|}{|S|} \textstyle\sum_{i\in S} h _ {k}(e _ i).\]
  • This is an estimate for a population total, so it is consistent and asymptotically normal (by the CLT).
    • We can estimate its variance as we would the variance of a mean.
  1. Fit the ERGM by finding \(\hat{\theta}\) that corresponds to (that is, produces) \(\tilde{g}(e _ S)\).
    • It exists because of ERGM’s properties.
  2. Use Delta Method to transfer properties of \(\tilde{g}(e _ S)\) to \(\hat{\theta}\).
    • We can do this, because MLE is a smooth function of \(g(y,x)\).

This approch allow us to use any statistics that can be observed in an egocentric sample as a term in an ERG model, estimate the model from a complete pseudo-network that has the appropriately scaled sufficient statistics, and test the coefficients for statistical significance. With the fitted model, we can also simulate networks that we know will be centered on the (scaled) joint distribution of the observed statistics.

4b3. Practical issues

In practice, egocentric sample statistics generally need to be adjusted for network size and some types of observable discrepancies.

Network Size

The treatment of network size is perhaps the most obvious way that egocentric estimation differs from a standard ERGM estimation on a completely observed network. With a network census, the network size is known; by contrast, with a network sample, we don’t typically know the size of the network from which it is drawn. However, if the egos are a random sample from the population, and the network statistics we observe in the sample scale in a known way with network size, then we can adjust for this in the estimation. The resulting parameter estimates will be “size invariant”.

Here we will follow Krivitsky, Handcock, and Morris (2011), who showed that one can obtain a per capita size invariant parameterization for dyad-independent statistics by using an offset, approximately equal to \(-\log(n)\), where \(n\) is the number of nodes in the network.


A term with a known coefficient in statistical modeling is called an “offset”. In this case the coefficient on the \(-\log(n)\) offset is 1.


The intuition is that this transforms the density-based parameterization that is the default scale for ERGMs (ties per dyad) into a mean degree-based parameterization (ties per node):

\[ \text{Density} = \frac{\text{ties}}{\text{dyads}} = \frac{T}{\frac{n(n-1)}{2}} = \frac{\text{Mean Degree}}{(n-1)} \]

Once the number of edges is adjusted to preserve the mean degree, Krivitsky et al. show that all of the dyad independent terms are properly scaled. For degree-based terms, we would want, by analogy, for per capita invariance to preserve the degree probability distribution. Experimental results suggest that the mean-degree preserving offset has this property, but a mathematical proof is elusive.

Observable discrepancies

In the population level network, the number of ties between groups must balance: the number of undirected ties between group 1 and group 2 is the same as between group 2 and group 1, because we observe the group attribute of each node for every tie.

In an egocentrically sampled network, this equality is no longer guaranteed. Sampling variability and measurement error can both lead to discrepancies: tie subtotals that are required to balance in theory, but are observed not to balance in the sample.

This is another unique feature of egocentrically sampled network data.

In ergm.ego, we assume that any discrepancy is due to sampling variation, and effectively take the average of the discrepant reports to estimate the number of ties for that ego-alter tie configuration. If you know the source of the discrepancy in your specific dataset, or want to make a different assumption you may want to address this before fitting the data in ergm.ego.

Constructing the egocentric target statistics

Now that we have a simple way to construct a consistent set of egocentric statistics and implement a size-invariant parameterization, we can set up the target statistics needed for the ERGM estimation. In many applications, the population size is in the millions. Therefore, we use per capita scaling to limit the size of the network we use for estimation, and increase the speed and efficiency of the process.

We first scale the sample statistics to a pseudo-population size \(|N'|\). The resulting values are the target statistics that we pass to ergm. We can then fit whe model with an offset of \(-\log(|N'|)\), which will return the per capita estimates that can be easily rescaled to any value for simulation purposes.

Note that if we want to obtain population estimates from ergm.ego for a population with a known size \(|N|\), we can fit the model with an offset of \(\log(|N|/{|N'|})=\log(|N|) - \log(|N'|)\). This is what the parameters would have been had we fit the ERGM to the population of size \(|N|\).

4b4. Inference: Estimating the standard errors of the ERGM coefficients

The standard errors for coefficients in an ergm.ego fit are designed to represent the uncertainty in our estimate. For ERGMs, this uncertainty can be thought of as coming from three possible sources:

  1. A superpopulation of networks, from which this one network is a single realization: What other networks could have been produced by the social process of interest?
  2. The sampling process of egos: What other samples could have been drawn?
  3. The stochastic MCMC algorithm used to estimate the coefficient: What other MCMC samples could we have gotten?

Most treatments of ERGM estimation treat the coefficient \(\theta\) as a parameter of a superpopulation process of which \(y\) is a single realization. The variance of the MLE of \(\theta\) is then conceived as coming from (1) and (3) above.

In contrast, in ergm.ego we treat the network as a fixed, unknown, finite population, so it is not a source of uncertainty. Rather, uncertainty comes from sampling from this network, and from the MCMC algorithm, (2) and (3) above.

This makes ergm.ego inference much more like traditional (frequentist) statistical inference: we imagine repeatedly drawing an egocentric sample, and estimating the ERGM on each replicate. The sampling distribution of the estimate reflects how our estimate will vary from sample to sample.

5. The package ergm.ego

Since ergm.ego is essentially a wrapper around ergm, there are relatively few functions in the ergm.ego package itself. The functions that are there deal with the specific requirements associated with data management, estimation and inference for egocentrically sampled data.

To get a list of documented functions, type:

The main R objects unique to ergm.ego are:

  • egodata objects for storing the original data (the analog to network objects in ergm),
  • ergm.ego objects, which store the model fit results (the analog to ergm objects in ergm).

Once you simulate from the fit, the resulting objects are just network objects.

The set of functions can be divided into groups as follows:

5b. Model terms available for ergm.ego

As with ERGM terms, terms in ergm.ego are coded as functions in R. The possible terms in an ergm.ego model are inherently limited to the egocentric ones: statistics that can be inferred from an egocentric sample. In general, these will include variants on nodal covariate terms, mixing terms, and degree distribution terms. The terms actually available in ergm.ego are limited in addition to those that have been coded up. At this point this is a small subset (n=11) of the terms currently available for ergm, but they have the same names and arguments as their ergm counterparts.

For a list of terms currently included in the ergm.ego, type:

Dyad independent terms include density and vertex attribute based measures:

  • Density: edges
  • Vertex attribute effects: nodefactor (for discrete/nominal vars) and nodecov (for continuous)
  • Homophily: nodematch (for homophily), nodemix (for general mixing patterns) and absdiff

Dyad dependent terms include degree-based measures:

  • Degree: degree, degrange and degreepopularity
  • Concurrency: concurrent and concurrentties

6. Example Analysis

We will work with the faux.mesa.high dataset that is included with the ergm package, using the as.egodata function to transform it into an egocentrically sampled network. In essence, this creates an egocentrically sampled census from the network. Knowing the complete network will allow us to compare the fits we get from ergm.ego and ergm.

Preliminaries:

Set seed for simulations – this is not necessary, but it ensures that we all get the same results (if we execute the same commands in the same order).

6a. Reading in Data

Start by creating a network object from the faux.mesa.high data:

Let’s first take a quick look at the complete network

Now, let’s turn this into an egodata object:

Network does not have vertex attribute 'vertex.names' to use as ego ID; using 1..N.

Take a look at this object – there are several ways to do this:

List of 4
 $ egos    :'data.frame':   205 obs. of  4 variables:
  ..$ vertex.names: int [1:205] 1 2 3 4 5 6 7 8 9 10 ...
  ..$ Grade       : num [1:205] 7 7 11 8 10 10 8 11 9 9 ...
  ..$ Race        : chr [1:205] "Hisp" "Hisp" "NatAm" "Hisp" ...
  ..$ Sex         : chr [1:205] "F" "F" "M" "M" ...
 $ alters  :'data.frame':   406 obs. of  4 variables:
  ..$ vertex.names: int [1:406] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ Grade       : num [1:406] 7 7 12 7 7 7 7 7 7 7 ...
  ..$ Race        : chr [1:406] "White" "NatAm" "Hisp" "NatAm" ...
  ..$ Sex         : chr [1:406] "F" "F" "F" "F" ...
 $ egoWt   : num [1:205] 1 1 1 1 1 1 1 1 1 1 ...
 $ egoIDcol: chr "vertex.names"
 - attr(*, "class")= chr "egodata"
         Length Class      Mode     
egos       4    data.frame list     
alters     4    data.frame list     
egoWt    205    -none-     numeric  
egoIDcol   1    -none-     character
  vertex.names Grade  Race Sex
1            1     7  Hisp   F
2            2     7  Hisp   F
3            3    11 NatAm   M
4            4     8  Hisp   M
5            5    10 White   F
6            6    10  Hisp   F
  vertex.names Grade  Race Sex
1            1     7 White   F
2            1     7 NatAm   F
3            1    12  Hisp   F
4            1     7 NatAm   F
5            1     7 White   F
6            1     7 NatAm   F

Note that the ego table contains the egoID, and the nodal attributes Race, Grade and Sex. The alter table also contains the egoID, an equivalent set of nodal attributes. The egoID column is used by functions that know how to work with egodata objects to match alters to the ego that nominated them. The alters do not have unique IDs (because we do not know if they are unique).

So that you get a chance to read in egodata from a .csv file, let’s write out the ego and alter tables, and then read them back in to create an egodata object:

Now read them back in:

  vertex.names Grade  Race Sex
1            1     7  Hisp   F
2            2     7  Hisp   F
3            3    11 NatAm   M
4            4     8  Hisp   M
5            5    10 White   F
6            6    10  Hisp   F
  vertex.names Grade  Race Sex
1            1     7 White   F
2            1     7 NatAm   F
3            1    12  Hisp   F
4            1     7 NatAm   F
5            1     7 White   F
6            1     7 NatAm   F

And to create an egodata object from them, we use the as.egodata function:

  vertex.names Grade  Race Sex
1            1     7  Hisp   F
2            2     7  Hisp   F
3            3    11 NatAm   M
4            4     8  Hisp   M
5            5    10 White   F
6            6    10  Hisp   F
  vertex.names Grade  Race Sex
1            1     7 White   F
2            1     7 NatAm   F
3            1    12  Hisp   F
4            1     7 NatAm   F
5            1     7 White   F
6            1     7 NatAm   F

We will explore some of the other functions available for manipulating the egodata object in a later section.

6b. Exploratory data analysis with ergm.ego

Prior to model specification, we can explore the data using descriptive statistics observable in the original egocentric sample. In general, the observable statistics are the same as those that ergm.ego can estimate.

We can use standard R commands to view nodal attribute frequencies:


  F   M 
 99 106 

Black  Hisp NatAm Other White 
    6   109    68     4    18 

To look at the mixing matrix, we use the same function we would use in ergm: mixingmatrix.
The function will return either the crosstabulation of the counts of ties by the attribute selected, or the conditional row probabilities.

    alter
ego    7  8  9 10 11 12
  7  150  0  0  1  1  1
  8    0 66  2  4  2  1
  9    0  2 46  7  6  4
  10   1  4  7 18  1  5
  11   1  2  6  1 34  5
  12   1  1  4  5  5 12
Note:  Marginal totals can be misleading
 for undirected mixing matrices.
    7  8  9 10 11 12
7  75  0  0  1  1  1
8   0 33  2  4  2  1
9   0  2 23  7  6  4
10  1  4  7  9  1  5
11  1  2  6  1 17  5
12  1  1  4  5  5  6
    alter
ego           7          8          9          10          11          12
  7  0.98039216 0.00000000 0.00000000 0.006535948 0.006535948 0.006535948
  8  0.00000000 0.88000000 0.02666667 0.053333333 0.026666667 0.013333333
  9  0.00000000 0.03076923 0.70769231 0.107692308 0.092307692 0.061538462
  10 0.02777778 0.11111111 0.19444444 0.500000000 0.027777778 0.138888889
  11 0.02040816 0.04081633 0.12244898 0.020408163 0.693877551 0.102040816
  12 0.03571429 0.03571429 0.14285714 0.178571429 0.178571429 0.428571429
       alter
ego          Black      Hisp     NatAm       Other      White
  Black 0.00000000 0.3076923 0.5000000 0.000000000 0.19230769
  Hisp  0.04494382 0.5955056 0.2303371 0.005617978 0.12359551
  NatAm 0.08333333 0.2628205 0.5897436 0.000000000 0.06410256
  Other 0.00000000 1.0000000 0.0000000 0.000000000 0.00000000
  White 0.11111111 0.4888889 0.2222222 0.000000000 0.17777778

We can also examine the observed number of ties, mean degree, and degree distributions.

[1] 406
[1] 1.980488
 degree0  degree1  degree2  degree3  degree4  degree5  degree6  degree7  degree8  degree9 degree10 
      57       51       30       28       18       10        2        4        1        2        1 
degree11 degree12 degree13 degree14 degree15 degree16 degree17 degree18 degree19 degree20 
       0        0        1        0        0        0        0        0        0        0 
 deg0.SexF  deg1.SexF  deg2.SexF  deg3.SexF  deg4.SexF  deg5.SexF  deg6.SexF  deg7.SexF  deg8.SexF 
        23         23         10         17         12          7          1          3          1 
 deg9.SexF deg10.SexF deg11.SexF deg12.SexF deg13.SexF  deg0.SexM  deg1.SexM  deg2.SexM  deg3.SexM 
         0          1          0          0          1         34         28         20         11 
 deg4.SexM  deg5.SexM  deg6.SexM  deg7.SexM  deg8.SexM  deg9.SexM deg10.SexM deg11.SexM deg12.SexM 
         6          3          1          1          0          2          0          0          0 
deg13.SexM 
         0 

For the degree distribution we used thesummaryfunction in the same way that we would use it in ergm with a network object. But the summary function also has an egodata specific argument, scaleto, that allows you to scale the summary statistics to a network of arbitrary size. So, for example, we can obtain the degree distribution scaled to a network of size 100,000, or a network that is 100 times larger than the sample.

   degree0    degree1    degree2    degree3    degree4    degree5    degree6    degree7    degree8 
27804.8780 24878.0488 14634.1463 13658.5366  8780.4878  4878.0488   975.6098  1951.2195   487.8049 
   degree9   degree10 
  975.6098   487.8049 
 degree0  degree1  degree2  degree3  degree4  degree5  degree6  degree7  degree8  degree9 degree10 
    5700     5100     3000     2800     1800     1000      200      400      100      200      100 

Note that the first scaling results in fractional numbers of nodes at each degree, because the proportion at each degree level does not scale to an integer for this population size. Again, this is not a problem for estimation, but one should be careful with descriptive statistics that expect integer values. The second scaling does result in integer counts because it is a multiple of the sample size.

We can plot the degree distribution using another generic function degreedist. As with the mixingmatrix function, this can return either the counts or the proportions at each degree.

The degreedist function also has an argument that lets you overplot the expected degree distribution for a Bernoulli random graph with the same expected density.

The brg overplot is based on 50 simulations of a Bernoulli random graph with the same number of nodes and expected density, implemented by using an ergm.ego simulation from an edges only model with \(\theta=\mbox{logit}(\mbox{probability of a tie})\) from the observed data. The overplot shows the mean and 2 standard deviations obtained for each degree value from the 50 simulations. Note that the brg automatically scales to the proportions when prob=T.

6c. Model Specification

The key characteristics we would like to capture in a model for this network are:

  • Variation in mean degree by nodal attributes (race, sex and grade)
  • Patterns of mixing by race, sex and grade
  • The degree distribution, (in particular the disproportionate isolate fraction)

We can use ergm.ego to fit a sequence of nested models to both estimate the parameters associated with these statistics, and test their significance. We can diagnose the both the estimation process (to verify convergence and good mixing in the MCMC sampler) and the fit of the model to the data. In both cases, we will use functionality that will be familiar to ergm users: MCMC diagnostics, and GOF.

One thing that is different from a standard ergm call is that we need to specify the scaling, both for the pseudo-population (\(N'\)) that will be used to set the target statistics during estimation, and for the population (N) size that the final rescaled coefficients will represent. Recall,

Population size \((|N|)\)
If we wanted to rescale the final estimates to reproduce the expected values in the true population network, we would need to know \(N\), the size of the popuation. In general, we do not know this, so the most useful scaling is a per capita scaling, which can easily be transformed into any value of \(|N|\) later (for simulation, or other purposes). For per capita scaling, \(|N|=1\). In ergm.ego, it is controlled by the popsize top-level argument.
Pseudo-population \((|N'|)\)
Recall from above that the target statistics can be scaled to an arbitrary population size for estimation. What size should we use? Several principles guide this choice:
  • Bias – In general, estimation bias is reduced the closer \(N'\) is to \(N\) (usually larger).

  • Computing time – The larger the pseudo-population, the longer the estimation takes.

  • Sample weights – In general, it is good practice for the smallest sample weight to produce at least 1 observation in the pseudo-population network, though more is better.

This leads to different guidelines for data with and without weights.

Simulation studies in Krivitsky & Morris (2017) suggest that a good rule of thumb is to have a minimum pseudo-population size of 1,000 for unweighted data. For weighted data the pseudo-populations size should be at least 1 * sampleSize/smallestWeight (or 3 * sampleSize/smallestWeight to be safe), or 1000 (whichever is larger).

In ergm.ego, \(|N'|\) is controlled by a combination of four factors:

  • the sample size \(|S|\) (i.e., number of egos),
  • the top-level argument popsize (\(|N|\) or 1) (default: 1),
  • the control.ergm.ego control parameter ppopsize (default: "auto"),
  • the control.ergm.ego control parameter ppopsize.mul (default: 1).

If ppopsize is left at its default ("auto"),

  • If popsize is left at 1, ppopsize \(=|S|\times\)ppopsize.mul.
  • If popsize is specified (as |N|), ppopsize \(=|N|\times\)ppopsize.mul.

You can also force one of these two regimes by setting ppopsize to "samp" or "pop", respectively, or set it to a number to force a particular \(|N'|\) ignoring ppopsize.mul.

For more information, see

We will give an example below.

In both cases, the scaling will only affect the estimate of the edges term, and we will demonstrate this below.

Let’s start with simple edges-only model to see what’s the same and what is different from a call to ergm:


==========================
Summary of model fit
==========================

Formula:   mesa.ego ~ edges

Iterations:  2 out of 20 

Monte Carlo MLE Results:
                    Estimate Std. Error MCMC % z value Pr(>|z|)    
offset(netsize.adj) -5.32301    0.00000      0    -Inf   <1e-04 ***
edges                0.69739    0.07908      0   8.819   <1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


 The following terms are fixed by offset and are not estimated:
  offset(netsize.adj) 

This is a simple model, for a homogenous tie probability – a Bernoulli random graph with the mean degree observed in our sampled data. The only difference in the syntax from standard ergm is the function call to ergm.ego. Let’s look under the hood at the components that are output to the fit.edges object:

 [1] "coef"             "sample"           "sample.obs"       "iterations"       "MCMCtheta"       
 [6] "loglikelihood"    "gradient"         "hessian"          "covar"            "failure"         
[11] "network"          "newnetworks"      "newnetwork"       "coef.init"        "est.cov"         
[16] "coef.hist"        "stats.hist"       "steplen.hist"     "control"          "etamap"          
[21] "ergm_version"     "MPLE_is_MLE"      "formula"          "target.stats"     "target.esteq"    
[26] "constrained"      "constraints"      "reference"        "estimate"         "offset"          
[31] "drop"             "estimable"        "null.lik"         "v"                "m"               
[36] "ergm.formula"     "ergm.offset.coef" "egodata"          "ppopsize"         "popsize"         
[41] "ergm.covar"       "DtDe"            
[1] 205
[1] 1

Many of the elements of the object are the same as you would get from an ergm fit, but the last few elements are unique to ergm.ego. Here you can see the ppopsize – the pseudo-population size used to construct the target statistics, and popsize – the final scaled population size after network size adjustment is applied. The values that were used in the fit were the default values, since we did not specify otherwise. So, ppopsize\(=205\) (the sample size, or number of egos), and popsize\(= 1\), so the scaling returns the per capita estimates from the model parameters.

The summary shows the netsize.adj = -5.32301, which is \(-\log(205)\).

The summary function also reports that:

 The following terms are fixed by offset and are not estimated:
  netsize.adj

So what would happen if we fit the model instead with target statistics from a pseudo-population of size 1000? To do this, we explicitly change the value of the ppopsize parameter through the control argument:

Now the netsize.adj value is \(-6.9077553 = -\log(1000)\).

Note that the value of the estimated edges coefficient is the same in both models, 0.698. This is the behavior we expect – the model is returning the same per capita value in both cases; it is just using a different scaling for the target statistics used in the fit. For this simple model, there may not be much difference in the properties of the estimates for these two different pseudo-population sizes.

We will examine the impact of modifying the popsize parameter in a later section below.

As the output shows, the model fit was fit using MCMC. This, too is different from the edges-only model using ergm. For ergm, models with only dyad-dependent terms are fit using Newton-Raphson algorthims (the same algorithm used for logistic regression), not MCMC. For ergm.ego, estimation is alway based on MCMC, regardless of the terms in the model.

Now let’s see what the MCMC diagnostics for this model look like

Sample statistics summary:

Iterations = 16384:4209664
Thinning interval = 1024 
Number of chains = 1 
Sample size per chain = 4096 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

          Mean             SD       Naive SE Time-series SE 
       -0.5615        13.8658         0.2167         0.2360 

2. Quantiles for each variable:

 2.5%   25%   50%   75% 97.5% 
  -28   -10    -1     9    27 


Sample statistics cross-correlations:
      edges
edges     1

Sample statistics auto-correlation:
Chain 1 
                 edges
Lag 0     1.0000000000
Lag 1024  0.0863834535
Lag 2048  0.0203495980
Lag 3072  0.0034593660
Lag 4096 -0.0002136637
Lag 5120 -0.0207309074

Sample statistics burn-in diagnostic (Geweke):
Chain 1 

Fraction in 1st window = 0.1
Fraction in 2nd window = 0.5 

 edges 
0.6972 

Individual P-values (lower = worse):
   edges 
0.485665 
Joint P-value (lower = worse):  0.5397549 .


MCMC diagnostics shown here are from the last round of simulation, prior to computation of final parameter estimates. Because the final estimates are refinements of those used for this simulation run, these diagnostics may understate model performance. To directly assess the performance of the final model on in-model statistics, please use the GOF command: gof(ergmFitObject, GOF=~model).

Again, this is a simple model, so the diagnostics suggest good mixing, and even though the distribution of the sample statistic deviations from the targets (on the right panel) can not technically be used to verify that the simulations from the model match the target stats. But the values here suggest they do. We can check that with a call to GOF.

So, it looks good there too.

Finally, we should evaluate the model fit. We can also use GOF to do this, by comparing observed statistics that are not in the model, like the full degree distribution, with simulations from the fitted model. This is the same procedure that we use for ergm, but now with a more limited set of observed higher-order statistics to use for assessment.

And here, finally, we see some bad behavior, but this too is expected from such a simple model. The GOF plot shows there are almost twice as many isolates in the observed data than would be predicted from a simple edges-only model.
Of course we knew this from having looked at the degree distribution plots with the Bernoulli random graph overlay.

Ok, so that’s a full cycle of description, estimation, and model assessment.

From here, let’s try fitting a degree(0) term to see how that changes the degree distribution assessment.


==========================
Summary of model fit
==========================

Formula:   mesa.ego ~ edges + degree(0)

Iterations:  3 out of 20 

Monte Carlo MLE Results:
                    Estimate Std. Error MCMC % z value Pr(>|z|)    
offset(netsize.adj)  -6.9324     0.0000      0    -Inf   <1e-04 ***
edges                 1.1710     0.1030      0  11.366   <1e-04 ***
degree0               1.4896     0.2588      0   5.756   <1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


 The following terms are fixed by offset and are not estimated:
  offset(netsize.adj) 

So, we’ve now fit the isolates exactly, and the fit is better, but this now suggests there are more nodes with just one tie than would be expected, given the mean degree, and the number of isolates.

And just to round things off, let’s fit a relatively large model. Note that we specify the omitted group for the Race nodefactor term with the levels argument. In general, it’s a good idea to omit the largest group, so that the comparisons have a robust base. That’s group “2” here, Hispanics.


==========================
Summary of model fit
==========================

Formula:   mesa.ego ~ edges + degree(0:1) + nodefactor("Sex") + nodefactor("Race", 
    levels = -2) + nodefactor("Grade") + nodematch("Sex") + nodematch("Race") + 
    nodematch("Grade")

Iterations:  2 out of 20 

Monte Carlo MLE Results:
                      Estimate Std. Error MCMC % z value Pr(>|z|)    
offset(netsize.adj)   -5.32301    0.00000      0    -Inf  < 1e-04 ***
edges                 -1.39016    0.18604      0  -7.472  < 1e-04 ***
degree0                2.13319    0.32843      0   6.495  < 1e-04 ***
degree1                1.02525    0.27915      0   3.673 0.000240 ***
nodefactor.Sex.M      -0.17386    0.05964      0  -2.915 0.003553 ** 
nodefactor.Race.Black  1.20110    0.19252      0   6.239  < 1e-04 ***
nodefactor.Race.NatAm  0.30249    0.05515      0   5.485  < 1e-04 ***
nodefactor.Race.Other -0.87205    0.74036      0  -1.178 0.238852    
nodefactor.Race.White  0.58035    0.12309      0   4.715  < 1e-04 ***
nodefactor.Grade.8     0.14507    0.03989      0   3.637 0.000276 ***
nodefactor.Grade.9     0.15222    0.04603      0   3.307 0.000943 ***
nodefactor.Grade.10    0.32483    0.07051      0   4.607  < 1e-04 ***
nodefactor.Grade.11    0.41138    0.05046      0   8.153  < 1e-04 ***
nodefactor.Grade.12    0.77743    0.08100      0   9.598  < 1e-04 ***
nodematch.Sex          0.64910    0.12658      0   5.128  < 1e-04 ***
nodematch.Race         0.84420    0.12724      0   6.635  < 1e-04 ***
nodematch.Grade        3.05616    0.14400      0  21.223  < 1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


 The following terms are fixed by offset and are not estimated:
  offset(netsize.adj) 

In general the model diagnostics look good. If this were a genuine sample of 205 students from a larger school, we could infer the following:

  • there are still many more isolates, and more degree 1 nodes than expected by chance
  • there are significant differences in mean degree by race, with the dominant group (Hispanics, the reference category) nominating fewer friends that the other groups
  • there may a difference in mean degree for 11th and 12th graders when compared to all the lower grades.
  • there are strong and significant homophily effects, for all three attributes

We will leave this model here and go on to explore how the idea of sampling uncertainty is being used to produce the standard errors for our coefficients.

Note that we have implicitly used simulate here – it’s the basis of the GOF results – but it is also possible to simulate complete networks from this fit:

                      original simulated difference
edges                      203       230        -27
degree0                     57        36         21
degree1                     51        56         -5
nodefactor.Sex.M           171       205        -34
nodefactor.Race.Black       26        28         -2
nodefactor.Race.NatAm      156       198        -42
nodefactor.Race.Other        1         4         -3
nodefactor.Race.White       45        45          0
nodefactor.Grade.8          75        81         -6
nodefactor.Grade.9          65        94        -29
nodefactor.Grade.10         36        37         -1
nodefactor.Grade.11         49        64        -15
nodefactor.Grade.12         28        40        -12
nodematch.Sex              132       151        -19
nodematch.Race             103       113        -10
nodematch.Grade            163       176        -13

We can use network size invariance to simulate networks of a different size, albeit one has to be careful:

                      original simulated difference
edges                      406       451        -45
degree0                    114        97         17
degree1                    102        88         14
nodefactor.Sex.M           342       400        -58
nodefactor.Race.Black       52        56         -4
nodefactor.Race.NatAm      312       356        -44
nodefactor.Race.Other        2         1          1
nodefactor.Race.White       90        88          2
nodefactor.Grade.8         150       151         -1
nodefactor.Grade.9         130       156        -26
nodefactor.Grade.10         72        92        -20
nodefactor.Grade.11         98       138        -40
nodefactor.Grade.12         56        63         -7
nodematch.Sex              264       287        -23
nodematch.Race             206       215         -9
nodematch.Grade            326       357        -31

We only demostrate the functionality briefly here, but this kind of simulation is a powerful way to diagnose structural properties of the fitted model, and to identify and remedy systematic lack of fit.

6d. Example of parameter recovery and sampling based on Faux Magnolia High:

When we estimate parameters based on sampled data, the sampling uncertainty in our estimates comes from the differences in the observations we draw from sample to sample, and the magnitude of uncertainty is a function of our sample size. This is why we typically see something like \(\sqrt{n}\) in the denominator of the standard error of a sample mean or sample proportion. The same principle holds in the context of egocentric network sampling: the standard errors will depend on the number of egos sampled.
This is true despite the fact that we are rescaling first to pseudo-population size, then back down to per capita values. Neither of these influences the estimates of the standard errors – those are influenced only by the size of the egocentric sample.

So let’s use the sample function from ergm.ego to demonstrate this effect. For this section we will use the larger built-in network, faux.magnolia.high.

[1] 1461

Let’s start by fitting an ERGM to the complete network, and looking at the coefficients. We’re going to use a model without any triadic terms, because we want to be able to compare these results to results from egocentrically sampled versions of these data. And we’ll again use the largest race group as the reference level – in this case White (level 6).

Warning: `set_attrs()` is deprecated as of rlang 0.3.0
This warning is displayed once per session.

Egocentric census

Now, suppose we only observe an egocentric view of the data – as an egocentric census.

With an egocentric census, it’s as though we give a survey to all of the students. Each student nominates her friends, but does not report the name of the friend, she only reports their sex, race and grade. How does the fit from ergm.ego to this egocentric census compare to the complete-network ergm estimates?

$egos
  vertex.names Grade  Race Sex
1            1     9 Black   F
2            2    10 Black   M
3            3    12 Black   F
4            4    11  Hisp   F
5            5     9 Black   F
6            6    11 White   M

$alters
  vertex.names Grade  Race Sex
1            1     9 Black   F
2            2     8 White   M
3            2    10 White   M
4            2    10 White   F
5            3    12 Black   F
6            3    12 Black   F

$egoWt
[1] 1 1 1 1 1 1

$egoIDcol
[1] "vertex.names"
Warning: `set_attrs()` is deprecated as of rlang 0.3.0
This warning is displayed once per session.

==========================
Summary of model fit
==========================

Formula:   fmh.ego ~ edges + degree(0) + nodefactor("Race", levels = -6) + 
    nodefactor("Sex") + nodematch("Race") + nodematch("Sex") + 
    absdiff("Grade")

Iterations:  2 out of 20 

Monte Carlo MLE Results:
                      Estimate Std. Error MCMC % z value Pr(>|z|)    
offset(netsize.adj)    0.00000    0.00000      0      NA       NA    
edges                 -6.92853    0.10103      0 -68.580   <1e-04 ***
degree0                0.66270    0.08141      0   8.140   <1e-04 ***
nodefactor.Race.Asian  0.86715    0.13411      0   6.466   <1e-04 ***
nodefactor.Race.Black  0.32013    0.04628      0   6.917   <1e-04 ***
nodefactor.Race.Hisp   0.64033    0.08002      0   8.002   <1e-04 ***
nodefactor.Race.NatAm  1.08327    0.16675      0   6.497   <1e-04 ***
nodefactor.Race.Other  0.73182    0.29970      0   2.442   0.0146 *  
nodefactor.Sex.M      -0.09686    0.03401      0  -2.848   0.0044 ** 
nodematch.Race         1.65937    0.07640      0  21.720   <1e-04 ***
nodematch.Sex          0.84710    0.06143      0  13.791   <1e-04 ***
absdiff.Grade         -2.10927    0.07017      0 -30.058   <1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


 The following terms are fixed by offset and are not estimated:
  offset(netsize.adj) 

Let’s compare the standard errors from the two fits. Here, we will treat the complete network as the true, fixed population network. Since the data for the second fit is an egocentric census, there is no sampling error in the sufficient statistics for the egocentric model terms.
The only error source of error in the estimation of both models therefore comes from the MCMC algorithm used to approximate the likelihood.

To extract the MCMC error component, we use our se.coef function written above, with the argument src = src = "estimation" (see help(vcov.ergm) for more information).

                 rowname se.coef.net se.coef.census se.ratio
1                  edges       0.007          0.006    0.923
2                degree0       0.003          0.003    0.945
3  nodefactor.Race.Asian       0.009          0.008    0.915
4  nodefactor.Race.Black       0.004          0.004    0.941
5   nodefactor.Race.Hisp       0.008          0.007    0.921
6  nodefactor.Race.NatAm       0.011          0.011    0.960
7  nodefactor.Race.Other       0.020          0.020    0.996
8       nodefactor.Sex.M       0.002          0.002    0.976
9         nodematch.Race       0.006          0.006    0.917
10         nodematch.Sex       0.004          0.004    0.959
11         absdiff.Grade       0.003          0.003    1.001

Both the SEs and the differences in the SEs are very small, with no discernable pattern. If desired, we could reduce the difference in the SEs between the two by increasing the size of our MCMC sample (we’ll leave that as an exercise for you, but we’ll move on for now).

How do the coefficient estimates compare for the two fits?

                 rowname coef.fit.net. coef.fit.census.  Zdiff
1                  edges        -6.950           -6.929 -2.327
2                degree0         0.661            0.663 -0.355
3  nodefactor.Race.Asian         0.908            0.867  3.405
4  nodefactor.Race.Black         0.323            0.320  0.545
5   nodefactor.Race.Hisp         0.683            0.640  3.905
6  nodefactor.Race.NatAm         1.072            1.083 -0.745
7  nodefactor.Race.Other         0.708            0.732 -0.842
8       nodefactor.Sex.M        -0.096           -0.097  0.142
9         nodematch.Race         1.673            1.659  1.624
10         nodematch.Sex         0.855            0.847  1.483
11         absdiff.Grade        -2.115           -2.109 -1.272

To compare the coefficient estimates, we are using a z-score for the difference, based on the MCMC error in each run. While there are some large z-scores, the magnitude of the differences in the coefficients are quite small (in the second decimal place). There is again no discernible pattern here, and we could make the z-score for the difference arbitrarily small by increasing the size of our MCMC sample.

Again, we can diagnose the fitted egocentric model for proper convergence.

Let’s check whether the fitted model can be used to reconstruct the degree distribution.

Warning: `set_attrs()` is deprecated as of rlang 0.3.0
This warning is displayed once per session.

Egocentric sample of the same size

What if we only had an equally large sample, instead of an egocentric census? Here, we take a simple random sample of N students with replacement.


==========================
Summary of model fit
==========================

Formula:   fmh.egosampN ~ edges + degree(0) + nodefactor("Race", levels = -6) + 
    nodefactor("Sex") + nodematch("Race") + nodematch("Sex") + 
    absdiff("Grade")

Iterations:  2 out of 20 

Monte Carlo MLE Results:
                      Estimate Std. Error MCMC % z value Pr(>|z|)    
offset(netsize.adj)    0.00000    0.00000      0      NA       NA    
edges                 -6.85097    0.08890      0 -77.062  < 1e-04 ***
degree0                0.64499    0.07687      0   8.390  < 1e-04 ***
nodefactor.Race.Asian  0.83146    0.12356      0   6.729  < 1e-04 ***
nodefactor.Race.Black  0.29411    0.04652      0   6.322  < 1e-04 ***
nodefactor.Race.Hisp   0.65488    0.08586      0   7.627  < 1e-04 ***
nodefactor.Race.NatAm  1.16423    0.08518      1  13.668  < 1e-04 ***
nodefactor.Race.Other  0.99725    0.34366      0   2.902  0.00371 ** 
nodefactor.Sex.M      -0.14279    0.03585      0  -3.983  < 1e-04 ***
nodematch.Race         1.60130    0.07972      0  20.086  < 1e-04 ***
nodematch.Sex          0.85583    0.05826      0  14.691  < 1e-04 ***
absdiff.Grade         -2.01440    0.06573      0 -30.648  < 1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


 The following terms are fixed by offset and are not estimated:
  offset(netsize.adj) 

The standard errors in this fit represent both MCMC estimation error and sampling error. Let’s compare them to those from the original network fit.

                 rowname se.coef.net se.coef.sampN se.ratio
1                  edges       0.007         0.089   12.872
2                degree0       0.003         0.077   25.386
3  nodefactor.Race.Asian       0.009         0.124   14.093
4  nodefactor.Race.Black       0.004         0.047   11.107
5   nodefactor.Race.Hisp       0.008         0.086   10.692
6  nodefactor.Race.NatAm       0.011         0.085    7.640
7  nodefactor.Race.Other       0.020         0.344   16.818
8       nodefactor.Sex.M       0.002         0.036   14.838
9         nodematch.Race       0.006         0.080   13.045
10         nodematch.Sex       0.004         0.058   15.120
11         absdiff.Grade       0.003         0.066   21.666

So – the error associated with the sampling process is quite large, relative to the MCMC estimation error, even with a sample of the same size as the original network (1461). We can see the impact of the sampling on our race distribution – the smaller the group, the more variable the outcome will be (e.g., the “Other” group here):

      census sampleN
Asian     48      50
Black    261     259
Hisp      68      66
NatAm     24      20
Other      7       4
White   1053    1062

And the coefficient estimates?

                 rowname coef.fit.net. coef.fit.sampN.  Zdiff
1                  edges        -6.950          -6.851 -1.115
2                degree0         0.661           0.645  0.211
3  nodefactor.Race.Asian         0.908           0.831  0.615
4  nodefactor.Race.Black         0.323           0.294  0.624
5   nodefactor.Race.Hisp         0.683           0.655  0.326
6  nodefactor.Race.NatAm         1.072           1.164 -1.076
7  nodefactor.Race.Other         0.708           0.997 -0.842
8       nodefactor.Sex.M        -0.096          -0.143  1.292
9         nodematch.Race         1.673           1.601  0.895
10         nodematch.Sex         0.855           0.856 -0.014
11         absdiff.Grade        -2.115          -2.014 -1.525

No significant differences.

Egocentric 25% sample

What if we have a smaller sample? If we have a 25% sample (0.25*N = 365 students), How will our standard errors be affected?

Let’s take a quick look at the racial breakdown of our sample, and compare it to the original complete net:

      census sample25%
Asian     48        11
Black    261        70
Hisp      68        17
NatAm     24         8
Other      7         1
White   1053       258
Warning: `set_attrs()` is deprecated as of rlang 0.3.0
This warning is displayed once per session.
                 rowname se.coef.sampN se.coef.samp25 se.ratio
1                  edges         0.089          0.167    1.874
2                degree0         0.077          0.133    1.731
3  nodefactor.Race.Asian         0.124          0.180    1.453
4  nodefactor.Race.Black         0.047          0.110    2.357
5   nodefactor.Race.Hisp         0.086          0.128    1.486
6  nodefactor.Race.NatAm         0.085          0.209    2.449
7  nodefactor.Race.Other         0.344          0.359    1.046
8       nodefactor.Sex.M         0.036          0.065    1.820
9         nodematch.Race         0.080          0.136    1.703
10         nodematch.Sex         0.058          0.119    2.043
11         absdiff.Grade         0.066          0.127    1.931

As with ordinary statistics, standard error is inverse-proportional to the square root of the sample size. We’ve divided the sample size by 4, so the standard errors for the smaller sample are about \(\sqrt{4} = 2\) times larger. There’s some variability because the sampling size reduction varied a bit by group.

And the coefficients?

                 rowname coef.fit.net. coef.fit.samp25.  Zdiff
1                  edges        -6.950           -6.725 -1.352
2                degree0         0.661            0.736 -0.562
3  nodefactor.Race.Asian         0.908            0.320  3.269
4  nodefactor.Race.Black         0.323            0.213  1.009
5   nodefactor.Race.Hisp         0.683            0.613  0.547
6  nodefactor.Race.NatAm         1.072            0.934  0.661
7  nodefactor.Race.Other         0.708            1.314 -1.686
8       nodefactor.Sex.M        -0.096           -0.051 -0.694
9         nodematch.Race         1.673            1.505  1.232
10         nodematch.Sex         0.855            0.905 -0.418
11         absdiff.Grade        -2.115           -2.214  0.785

Most differences are not significant. But one (out of 11) is: the coefficient for Asian. So, this particular sample produced a biased estimate. As an exercise, change the seed, and rerun this section, to see what happens with another sample.

Egocentric 25% weighted sample

Finally, suppose that stick with our smaller sample, but modify our sampling design, oversampling the smaller minority groups by a factor of 4. The inclusion probability for a non-white student will now be 4 times greater than in a simple random sample.

Once we weight our sample, we have to take account of the sampling weights in the analysis. ergm.ego can do this automatically for you, as long as the sampling weight is stored in a component of the egodata object called egoWt. If we use the sample function on our egodata object, the probabilities are automatically stored in egoWt.

$egos
     vertex.names Grade  Race Sex
1261         1261     7 White   F
1426         1426     9 White   F
660           660    12 White   M
848           848     8  Hisp   F
671           671    10 White   F
613           613    10 Black   M

$alters
  vertex.names Grade  Race Sex
1            1     9 Black   F
2          1.1     9 Black   F
3         1001    11 White   M
4         1003    12 White   M
5         1003    10 White   F
6         1058     8 White   F

$egoWt
[1] 1.00 1.00 1.00 0.25 1.00 0.25

$egoIDcol
[1] "vertex.names"

Let’s take a quick look at the racial breakdown of our weighted sample, and compare it to the unweighted 25% sample:

      sample25% sampleWtd25%
Asian        11           26
Black        70          141
Hisp         17           25
NatAm         8            9
Other         1            3
White       258          161

Fit the model to the weighted sample.

Warning: `set_attrs()` is deprecated as of rlang 0.3.0
This warning is displayed once per session.

Are the standard errors are improved by this sampling scheme? Note here we are comparing the 25% sample to the 25% weighted sample.

Joining, by = "rowname"
                 rowname se.coef.samp25 se.coef.samp25wtd se.ratio
1                  edges          0.167             0.284    1.702
2                degree0          0.133             0.214    1.607
3  nodefactor.Race.Asian          0.180             0.275    1.531
4  nodefactor.Race.Black          0.110             0.081    0.736
5   nodefactor.Race.Hisp          0.128             0.194    1.524
6  nodefactor.Race.NatAm          0.209             0.127    0.608
7  nodefactor.Race.Other          0.359             0.576    1.603
8       nodefactor.Sex.M          0.065             0.087    1.332
9         nodematch.Race          0.136             0.172    1.268
10         nodematch.Sex          0.119             0.160    1.344
11         absdiff.Grade          0.127             0.167    1.312

The weighting did not improve our SEs.

But how do the coefficient estimates compare?

                 rowname coef.fit.net. coef.fit.samp25. coef.fit.samp25wtd. Zdiff.25wtd.net
1                  edges        -6.950           -6.725              -6.846           0.318
2                degree0         0.661            0.736               0.537          -0.492
3  nodefactor.Race.Asian         0.908            0.320               0.635          -0.832
4  nodefactor.Race.Black         0.323            0.213               0.193          -0.959
5   nodefactor.Race.Hisp         0.683            0.613               0.791           0.464
6  nodefactor.Race.NatAm         1.072            0.934               1.384           1.278
7  nodefactor.Race.Other         0.708            1.314               1.111           0.595
8       nodefactor.Sex.M        -0.096           -0.051              -0.229          -1.217
9         nodematch.Race         1.673            1.505               1.577          -0.438
10         nodematch.Sex         0.855            0.905               0.772          -0.416
11         absdiff.Grade        -2.115           -2.214              -1.904           1.009

But we do see improvements in the estimates – now there are no significant differences from the true values.

7. Ongoing Developments

The package is under active development on GitHub. The public release (on CRAN) is in the repository statnet/ergm.ego–this is the place to go to report bugs or request features. We also have a non-public development branch for working on new features and bugfixes. If you are interested in contributing to the development of ergm.ego, please contact us through the GitHub interface.

The following features are currently implemented in the GitHub development branch, and will be released in the next package update:

  • Finite population correction.

  • Support for more complex sampling designs, integrated with R package survey.

  • Support for directed relations.

  • Support for main effects of covariates measured only on the egos.

  • Support for alter–alter nominations, integrated with egonetR.

Additional functionality is planned in the near future:

  • Support for automatic fitting of tergms.

8. References

Krivitsky, P. N., & Morris, M. (2017). Inference for social network models from egocentrically sampled data, with application to understanding persistent racial disparities in HIV prevalence in the US. Annals of Applied Statistics, 11(1), 427-455.

Krivitsky, Pavel N.; Kolaczyk, Eric D. On the Question of Effective Sample Size in Network Modeling: An Asymptotic Inquiry. Statist. Sci. 30 (2015), no. 2, 184–198. doi:10.1214/14-STS502. https://projecteuclid.org/euclid.ss/1433341477

Morris, M., Kurth, A. E., Hamilton, D. T., Moody, J., & Wakefield, S. (2009). Concurrent partnerships and HIV prevalence disparities by race: linking science and public health practice. American Journal of Public Health, 99(6), 1023-1023.

We are working on a paper describing the package for the Journal of Statistical Software with a target submission date in 2019.


  1. This does not mean that the mean degree itself cannot be estimated from egocentric data, only that our inferential results might not apply.